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Abstract 

In support vector machine (SVM) applications with unreliable data that contains a portion of outliers, 
non-robustness of SVMs often causes considerable performance deterioration. Although many approaches 
for improving the robustness of SVMs have been studied, two major challenges remain in robust SVM 
learning. First, robust learning algorithms are essentially formulated as non-convex optimization prob¬ 
lems because the loss function must be designed to alleviate the influence of outliers. It is thus important 
to develop a non-convex optimization method for robust SVM that can find a good local optimal solution. 
The second practical issue is how one can tune the hyperparameter that controls the balance between 
robustness and efficiency. Unfortunately, due to the non-convexity, robust SVM solutions with slightly 
different hyper-parameter values can be significantly different, which makes model selection highly unsta¬ 
ble. In this paper, we address these two issues simultaneously by introducing a novel homotopy approach 
to non-convex robust SVM learning. Our basic idea is to introduce parametrized formulations of ro¬ 
bust SVM which bridge the standard SVM and fully robust SVM via the parameter that represents 
the influence of outliers. We characterize the necessary and sufficient conditions of the local optimal 
solutions of robust SVM, and develop an algorithm that can trace a path of local optimal solutions when 
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the influence of outliers is gradually decreased. An advantage of our homotopy approach is that it can 
be interpreted as simulated annealing, a common approach for finding a good local optimal solution in 
non-convex optimization problems. In addition, our homotopy method allows stable and efficient model 
selection based on the path of local optimal solutions. Empirical performances of the proposed approach 
are demonstrated through intensive numerical experiments both on robust classification and regression 
problems. 


1 Introduction 

The support vector machine (SVM) has been one of the most successful machine learning algorithms [1, 2, 3]. 
However, in recent practical machine learning applications with less reliable data that contains a portion of 
outliers (e.g., consider situations where the labels are automatically obtained by semi-supervised learning 
[4, 5] or manually annotated in crowdsourcing framework [6, 7]), non-robustness of the SVM often causes 
considerable performance deterioration. See Figurel for examples of robust classification and regression. 




(a) Standard SVC (b) Robust SVC 




(c) Standard SVR (d) Robust SVR 

Figure 1: Illustrative examples of (a) standard SVC (classification) and (b) robust SVC, (c) standard SVR 
(regression) and (d) robust SVR on toy dataset. In robust SVM, the classification and regression results are 
not sensitive to the two red outliers in the right-hand side of the plots. 
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1.1 Limitations of Existing Robust Classification and Regression 

Although a great deal of efforts have been devoted to improving the robustness of SVM and other similar 
learning algorithms [8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18], the problem of learning SVM under massive noise 
still poses two major challenges. 

First, robust learning algorithms are essentially formulated as non-convex optimization problems because 
the loss function must be designed to alleviate the effect of outliers (see the robust loss functions for classifi¬ 
cation and regression problems in Figure2 and Figure3). Since there could be many local optimal solutions 
in non-convex optimization problems, it is important to develop non-convex optimization tricks such as 
simulated annealing [19] that enables us to find good local optimal solution. 

The second practiall issue is how to make a balance between robustness and efficiency. Any robust 
SVM formulations contain an additional hyper-parameter for controling the trade-off between robustness 
and efficiency. Since such a hyper-parameter gorverns the influence of outliers on the model, it must be 
carefully tuned based on the property of noise contained in the data set. In practice, we need to empirically 
select an appropriate value for the hyper-parameter, e.g., by cross-validation, because we usually do not 
have sufficient knowledge about the noise in advance. Furthermore, due to the non-convexity, robust SVM 
solutions with slightly different hyper-parameter values can be significantly different, which makes model 
selection highly unstable. 

In this paper, we address these two issues simultaneously by introducing a novel homotopy approach 
to robust SV classification (SVC) and SV regression (SVR) learnings 1 . Our basic idea is to consider 
parametrized formulations of robust SVC and SVR which bridge the standard SVM and fully robust SVM 
via a parameter that gorverns the influence of outliers. We use homotopy methods [20, 21, 22, 23] for 
tracing a path of solutions when the influence of outliers is gradually decreased. We call the parameter 
as the robustness parameter and the path of solutions obtained by tracing the robustness parameter as 
the robustification path. Figure2 and Figure3 illustrate how the robust loss functions for classification and 
regression problems can be gradually robustified, respectively. 

1.2 Our contributions 

Our first technical contribution is in analyzing the properties of the robustification path for both classification 
and regression problems. In particular, we derive the necessary and sufficient conditions for SVC and SVR 
solutions to be locally optimal (note that the well-known Karush-Khun Tucker (KKT) conditions are only 
necessary, but not sufficient). Interestingly, the analyses indicate that the robustification paths contain a 
finite number of dicontinuous points. To the best of our knowledge, the above property of robust learning 
has not been known previously. 

1 For regression problems, we study least absolute deviation (LAD) regression. It is straightforward to extend it to original 
SVR formulation with s-insensitive loss function. In order to simplify the description, we often call LAD regression as SV 
regression (SVR). In what follows, we use the term SVM when we describe common properties of SVC and SVR. 
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(b) Homotopy computation with decreasing s from —oo to 0. 

Figure 2: Robust loss functions for various homotopy parameters 8 and s. 
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(a) Homotopy computation with decreasing 9 from 1 to 0. 
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(b) Homotopy computation with decreasing s from —oo to 0. 

Figure 3: Regression version of robust loss functions for various homotopy parameters 9 and s. 


Our second contribution is to develop an efficient algorithm for actually computing the robustification 
path based on the above theoretical investigation of the geometry of robust SVM solutions. Here, we use 
parametric programming technique [20, 21, 22, 23], which is often used for computing the regularization 
path in machine learning literature. The main technical challenge here is how to handle the discontinuous 
points in the robustification path. We develop an algorithm that can precisely detect such discontinuous 
points, and jump to find a strictly better local optimal solution. Unlike solution path of convex problems 
[20, 21, 22, 23], our local optimal solution paths is shown to have a finite number of discrete points. We 
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overcome this difficulty by precisely analyzing the necessary and sufficient conditions for the local optimality. 

Many existing studies on robust SVM employ Concave Convex Procedure (CCCP) [24] or a variant called 
Difference of Convex (DC) programming [9, 10, 11, 12, 14, 15]. In these methods, a non-convex loss function 
is decomposed into the concave and convex parts. A local optimal solution will be obtained by iteratively 
solving a sequence of convex optimization problems. Local optimal solutions found by CCCP are sensitive to 
hyper-parameter values (for controling the robustness and efficiency balance), which makes model selection 
highly challenging. 

Experimental results indicate that our outlier approach can find better robust SVM solutions more 
efficiently than alternative approaches based on CCCP. We conjecture that there are two reasons why 
favorable results can be obtained. At first, the robustification path shares similar advantage to simulated 
annealing [19]. Simulated annealing is known to find better local solutions in many non-convex optimization 
problems by solving a sequence of solutions along with so-called temprature parameter. If we regard the 
robustness parameter as the temprature parameter, our robustification path algorithm can be interpreted as 
simulated annealing with infinitesimal step size. Another possible explanation for favorable performances of 
our method is the ability of stable and efficient model selection. Since our algorithm provides the path of 
local solutions, unlike other non-convex optimization algorithms such as CCCP, two solutions with slightly 
different robustness parameter values tend to be similar, which makes model selection stable. Accoding to 
our experiments, choice of the robustness parameter is quite sensitive to the generalization performances. 
Thus, it is important to finely tune the robustness parameter. Since our algorithm can compute the path of 
solutions, it is much more computationally efficient than running CCCP many times at different robustness 
parameter values. 

1.3 Structure of This Paper 

After we formulate robust SVC and SVR as parametrized optimization problems in § 2, we derive in § 3 the 
necessary and sufficient conditions for a robust SVM solution to be locally optimal, and show that there exist 
a finite number of discontinuous points in the local solution path. We then propose an efficient algorithm 
in § 4 that can precisely detect such discontinuous points and jump to find a strictly better local optimal 
solution. In § 5, we experimentally demonstrate that our proposed method, named the robustification path 
algorithm, outperforms the existing robust SVM algorithm based on CCCP or DC programming. Finally, 
we conclude in § 6. 

This paper is an extented version of our preliminary conference paper presented at ICML 2014 [25]. 
In this paper, we have extended our robusitification path framework to the regression problem, and many 
more experimental evaluations have been conducted. To the best of our knowledge, the homotopy method 
[20, 21, 22, 23] is first used in our preliminary conference paper in the context of robust learning, So far, 
homotopy-like methods have been (often implicitly) used for non-convex optimization problems in the context 
of sparse modeling [26, 27, 28] and semi-supervised learning [29]. 
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2 Parameterized Formulation of Robust SVM 


In this section, we first formulate robust SVMs for classification and regression problems, which we denote 
by robust SVC (SV classification) and robust SVR (SV regression), respectively. Then, we introduce param¬ 
eterized formulation both for robust SVC and SVR, where the parameter gorverns the influence of outlieres 
to the model. The problem is reduced to ordinary non-robust SVM at one end of the parameter, while the 
problem corresponds to fully-robust SVM at the other end of the parameter. In the following sections, we 
develop algorithms for computing the path of local optimal solutions when the parameter is changed form 
one end to the other. 


2.1 Robust SV Classification 

Let us consider a binary classification problem with n instances and d features. We denote the training set 
as {(xi, y,:)}igN„ where aq £ X is the input vector in the input space X C R d , y,; € (— 1 , 1 } is the binary 
class label, and the notation N n := {1,... , n} represents the set of natural numbers up to n. We write the 
decision function as 


f(x) ■= w t 4>{x ), 


(1) 


where (p is the feature map implicitly defined by a kernel K, w is a vector in the feature space, and T 
denotes the transpose of vectors and matrices. 

We introduce the following class of optimization problems parameterized by 9 and s : 

1 n 

min -|M| 2 + C^£(y i /(a; i ); 0,s), (2) 

i 

where C > 0 is the regularization parameter which controls the balance between the first regularization term 
and the second loss term. The loss function t is characterized by a pair of parameters 9 £ [0,1] and s < 0 as 


i(z; 9,s) := 


[0,1-2] + , Z>S, 
1 — 9z — s, z < s, 


(3) 


where [z\+ '■= max{0, z}. We refer to 9 and s as the homotopy parameters. Figure 2 shows the loss functions 
for several 9 and s. The first homotopy parameter 9 can be interpreted as the weight for an outlier: 9 = 1 
indicates that the influence of an outlier is the same as an inlier, while 0 = 0 indicates that outliers are 
completely ignored. The second homotopy parameter s < 0 can be interpreted as the threshold for deciding 
outliers and inliers. 

In the following sections, we consider two types of homotopy methods. In the first method, we fix s = 0, 
and gradually change 9 from 1 to 0 (see the top five plots in Figure 2). In the second method, we fix 9 = 0 
and gradually change s from —oo to 0 (see the bottom five plots in Figure 2). Note that the loss function 
is reduced to the hinge loss for the standard (convex) SVC when 9 = 1 or s = —oo. Therefore, each of the 
above two homotopy methods can be interpreted as the process of tracing a sequence of solutions when the 


6 



optimization problem is gradually modified from convex to non-convex. By doing so, we expect to find good 
local optimal solutions because such a process can be interpreted as simulated annealing [19]. In addition, 
we can adaptively control the degree of robustness by selecting the best 9 or s based on some model selection 
scheme. 

2.2 Robust SV Regression 

Let us next consider a regression problem. We denote the training set of the regression problem as 
{(aJi , 2/i , where the input Xi G X is the input vector as the classification case, while the output 

j/i G R is a real scalar. We consider a regression function f(x) in the form of (1). SV regression is formulated 
as 

1 " 

min -\\wf +C^2e(yi- f(xi); 6,s), (4) 

i =1 

where C > 0 is the regularization parameter, and the loss function £ is defined as 

M, 

(|z| - s)6 + s, 

The loss function in (5) has two parameters 9 G [0,1] and s G 
shows the loss functions for several 9 and s. 



M > s. 

[0, oo ) as the classification case. Figure 3 


3 Local Optimality 

In order to use the homotopy approach, we need to clarify the continuity of the local solution path. To 
this end, we investigate several properties of local solutions of robust SVM, and derive the necessary and 
sufficient conditions. Interestingly, our analysis reveals that the local solution path has a finite number of 
discontinuous points. The theoretical results presented here form the basis of our novel homotopy algorithm 
given in the next section that can properly handle the above discontinuity issue. We first discuss the local 
optimality of robust SVC in detail in § 3.1 and § 3.2, and then present the corresponding result of robust 
SVR briefly in § 3.3. 

3.1 Conditionally Optimal Solutions (for Robust SVC) 

The basic idea of our theoretical analysis is to reformulate the robust SVC learning problem as a combinatorial 
optimization problem. We consider a partition of the instances := {1,..., n} into two disjoint sets I and 
O. The instances in I and O are defined as Inliers and Outliers, respectively. Here, we restrict that the 
margin y t f(xj) of an inlier should be larger than s, while that of an outlier should be smaller than s. We 
denote the partition as V := {1,0} G 2 Nn , where 2 Nn is the power set 2 of N„. Given a partition V, the 

2 The power set means that there are 2 n patterns that each of the instances belongs to either I or O. 
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above restrictions define the feasible region of the solution / in the form of a convex polytope 3 : 


pol(T;s) := < f 


yif( x i) > s, iel 1 
Vif(xi) <s, ieO J 

Using the notion of the convex polytopes, the optimization problem (2) can be rewritten as 


(6) 


min min J-p(f\d) , 

•Pe2«n l / epo i(-p ;s ) / 


(7) 


where the objective function J-p is defined as 4 


:= ^\\w\\l + C (j2[l - yif{xi)\ + + dJ2[l - yif{xi)\ + j . 

Vie i ieo J 

When the partition V is fixed, it is easy to confirm that the inner minimization problem of (7) is a convex 
problem. 


Definition 1 (Conditionally optimal solutions) Given a partition V, the solution of the following con¬ 
vex problem is said to be the conditionally optimal solution: 

fv-= argmin J v (f;6). (8) 

/€pol(P;s) 

The formulation in (7) is interpreted as a combinatorial optimization problem of finding the best solution 
from all the 2 n conditionally optimal solutions ff, corresponding to all possible 2 n partitions 5 . 

Using the representer theorem or convex optimization theory, we can show that any conditionally optimal 
solution can be written as 

fvi x ) : = a jyjK{x,Xj), (9) 

je N„ 

where {a*},, e N n are the optimal Lagrange multipliers. The following lemma summarizes the KKT optimality 
conditions of the conditionally optimal solution ff,. 


Lemma 2 The KKT conditions of the convex problem (8) is written as 


Vifr(xi ) > 1 

=> 

_ * 

= 0, 

(10a) 

Vifr(xi) = 1 

=> 

_ * 
a i 

£ [0, C], 

(10b) 

s < Vifp{xi) < 1 

=> 

_ * 
a i 

= C, 

(10c) 

yifvixi) = s,iex 

=> 

_ * 

>C, 

(lOd) 

Vifvixi) = s,i€0 

=> 

_ * 
a i 

< CO, 

(lOe) 

Vifv( x i ) < s 

=> 

_ * 

= ce. 

(lOf) 


3 Note that an instance with the margin yif(xi) = s can be the member of either X or O. 

4 Note that we omitted the constant terms irrelevant to the optimization problem. 

5 For some partitions V. the convex problem (8) might not have any feasible solutions. 



The proof is omitted because it can be easily derived based on standard convex optimization theory [30]. 


3.2 The necessary and sufficient conditions for local optimality (for Robust 

SVC) 

From the definition of conditionally optimal solutions, it is clear that a local optimal solution must be 
conditionally optimal within the convex polytope po^Xjs). However, the conditional optimality does not 
necessarily indicate the local optimality as the following theorem suggests. 

Theorem 3 For any 9 G [0, 1) and s < 0, consider the situation where a conditionally optimal solution ff, is 
at the boundary of the convex polytope po\(fP] s), i.e., there exists at least an instance such that yiff>(xi) = s. 
In this situation, if we define a new partition V := {X, O} as 

X-(—X\{* G l\yif*(xi) = s}U{i G 0\yif*(xi)=s}> (11a) 

6^0\{i & 0\y i f*{x i ) = s}U{i el\y i f*(x i ) = s}, (lib) 

then the new conditionally optimal solution ft is strictly better than the original conditionally optimal solu¬ 
tion ff,, i.e., 

Mf^d)<Mff,;d). (12) 

The proof is presented in Appendix A. Theorem 3 indicates that if ff, is at the boundary of the convex 
polytope pol^s), i.e., if there is one or more instances such that yiff(xi) = s, then ff, is NOT locally 
optimal because there is a strictly better solution in the opposite side of the boundary. 

The following theorem summarizes the necessary and sufficient conditions for local optimality. Note that, 
in non-convex optimization problems, the KKT conditions are necessary but not sufficient in general. 

Theorem 4 For 9 G [0,1) and s < 0, 


Vif*(xi ) > 1 

=> 

a* = 0, 

(13a) 

Vif*{xi ) = 1 


a* G [0,C], 

(13b) 

s < y l f*(x i ) < 1 

=> 

a* = C, 

(13c) 

Vif*{xi) < s 

=> 

a* = CO, 

(13d) 

f s, 

Vi G 

N„, 

(13e) 


are necessary and sufficient for f* to be locally optimal. 
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The proof is presented in Appendix B. The condition (13e) indicates that the solution at the boundary of 
the convex polytope is not locally optimal. Figure 4 illustrates when a conditionally optimal solution can be 
locally optimal with a certain 9 or s. 

Theorem 4 suggests that, whenever the local solution path computed by the homotopy approach en¬ 
counters a boundary of the current convex polytope at a certain 9 or s, the solution is not anymore locally 
optimal. In such cases, we need to somehow find a new local optimal solution at that 6 or s, and restart the 
local solution path from the new one. In other words, the local solution path has discontinuity at that 6 or 
s. Fortunately, Theorem 3 tells us how to handle such a situation. If the local solution path arrives at the 
boundary, it can jump to the new conditionally optimal solution which is located on the opposite side of 
the boundary. This jump operation is justified because the new solution is shown to be strictly better than 
the previous one. Figure 4 (c) and (d) illustrate such a situation. 





Figure 4: Solution space of robust SVC. (a) The arrows indicate a local solution path when 9 is gradually 
moved from 9\ to 9§ (see § 4 for more details), (b) is locally optimal if it is at the strict interior of the 
convex polytope pol^s). (c) If exists at the boundary, then is feasible, but not locally optimal. A 
new convex polytope pol^; s) defined in the opposite side of the boundary is shown in yellow, (d) A strictly 
better solution exists in po^P; s). 
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3.3 Local optimality of SV Regression 

In order to derive the necessary and sufficent conditions of the local optimality in robust SVR, with abuse 
of notation, let us consider a partition of the instances N ra into two disjoint sets 1 and O , which represent 
inliers and outliers, respectively. In regression problems, an instance ( Xi,y.i ) is regarded as an outlier if the 
aboslute residual \yi — f(xi )| is sufficiently large. Thus, we define inliers and outliers of regression problem 
as 


1 := {i e N„| |yi - f(xi )| < s}, 
O := {i e N„| | y t . - /(*»)| > s}. 


Given a partition V := {1, O} e 2 Nn , the feasible region of the solution / is represented as a convex polytope: 

I Vi ~ /( Xi)\ <s, i el, 1 
I Vi - /(ah) I > s, ieO J 

Then, as in the classification case, the optimization problem (4) can be rewritten as 


pol(P ; s) := { f 


min min 
v \fepo\(r-,s) 


where the objective function J-p is defined as 


(14) 


(15) 


Mf-,0) = \\\w\\l + c ( J2\yi - f(xi)\ + eJ2\yi - f( x i)\ J • 

Vie i ieO / 

Since the inner problem of (15) is a convex problem, any conditionally optimal solution can be written as 


fv( x ) : = Yl a *j K ( x > x j)- ( 16 ) 

i£N„ 

The KKT conditions of fp(x) are written as 

\Vi - fv(xi )I =0 =► 0 < \a*\ < C, (17a) 

0 < \ V i - fr(xi )| < s =► |a* | = C, (17b) 

\Vi - fv(xi )| = s,i €1 =► |a*| > C, (17c) 

|y< - #(*01 = V60 =► K| < 0(7, (17d) 

|?/i — fv( x i)\ > s |a*|=0C'. (17e) 


Based on the same discussion as §3.2, the necessary and sufficient conditions for the local optimality of 
robust SVR are summarized as the following theorem: 


Theorem 5 For 9 e [0,1) and s > 0, 


lift-#(*01=0 =► 0< \a*\ <C, 
o < | Vi- fv(xi)\ < s =» I a* I = C, 

I Vi - fv(xi) I > s => K | = oc, 
\yi- fv(xi)\ + s. 


(18a) 

(18b) 

(18c) 

(18d) 
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Algorithm 1 Outlier Path Algorithm 
l: Initialize the solution / by solving the standard SVM. 

2: Initialize the partition V := {X, O} as follows: 

!<-{*£ N„| yif{xi) < s}, 
O -f— {* € N n \y-if{xi) > s}. 


3: 0 i — 1 for OP-0; s <— min, e N n yif(xi ) for OP-s. 
4: while 0 > 0 for OP-0; s < 0 for OP-s do 
5: if {yif{xi) ^ s V i <E N„) then 

6: Run C-step. 

7: else 

8: Run D-step. 

9: end if 

10: end while 


are necessary and sufficient for f* to be locally optimal. 

We omit the proof of this theorem because they can be easily derived in the same way as Theorem 4. 

4 Outlier Path Algorithm 

Based on the analysis presented in the previous section, we develop a novel homotopy algorithm for robust 
SVM. We call the proposed method the outlier-path (OP) algorithm. For simplicity, we consider homotopy 
path computation involving either 0 or s, and denote the former as OP-0 and the latter as OP-s. OP-0 
computes the local solution path when 0 is gradually decreased from 1 to 0 with fixed s = 0, while OP-s 
computes the local solution path when s is gradually increased from — oo to 0 with fixed 0 = 0. 

The local optimality of robust SVM in the previous section shows that the path of local optimal solutions 
has finite discontinuous points that satisfy (13e) or (18d). Below, we introduce an algorithm that appropri¬ 
ately handles those discontinuous points. In this section, we only describe the algorithm for robust SVC. All 
the methodologies described in this section can be easily extended to robust SVR counterpart. 

4.1 Overview 

The main flow of the OP algorithm is described in Algorithm 1. The solution / is initialized by solving 
the standard (convex) SVM, and the partition V := {1,0} is defined to satisfy the constraints in (6). The 
algorithm mainly switches over the two steps called the continuous step (C-step) and the discontinuous step 
(D-step). 

In the C-step (Algorithm 2), a continuous path of local solutions is computed for a sequence of gradually 
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Algorithm 2 Continuous Step (C-step) 

1: while (; tjif(xi ) / sVie N n ) do 

2: Solve the sequence of convex problems, 


min 

/£pol(P;s) 


JrU'iO) > 


for gradually decreasing 0 in OP-0 or gradually increasing s in OP-s. 

3: end while 


Algorithm 3 Discontinuous Step (D-step) 

1: Update the partition V := {I, O} as follows: 

1 <— X \ {i £ X\inf(xi) = s}U{i£ 0\yif(x.i) = s}, 
O O \ (i G 0\yif{xi) = s}U{i£ l\yif(xi) = s}. 


2: Solve the following convex problem for fixed 0 and s: 


min 

fepol(V;s) 


Mm- 


decreasing 8 (or increasing s) within the convex polytope pol('P; s) defined by the current partition V. If the 
local solution path encounters a boundary of the convex polytope, i.e., if there exists at least an instance 
such that yif(xi) = s, then the algorithm stops updating 8 (or s) and enters the D-step. 

In the D-step (Algorithm 3), a better local solution is obtained for fixed 9 (or s) by solving a convex 
problem defined over another convex polytope in the opposite side of the boundary (see Figure 4(d)). If the 
new solution is again at a boundary of the new polytope, the algorithm repeatedly calls the D-step until it 
finds the solution in the strict interior of the current polytope. 

The C-step can be implemented by any homotopy algorithms for solving a sequence of quadratic problems 
(QP). In OP-0, the local solution path can be exactly computed because the path within a convex polytope 
can be represented as piecewise-linear functions of the homotopy parameter 0. In OP-s, the C-step is trivial 
because the optimal solution is shown to be constant within a convex polytope. In § 4.2 and § 4.3, we will 
describe the details of our implementation of the C-step for OP-0 and OP-s, respectively. 

In the D-step, we only need to solve a single quadratic problem (QP). Any QP solver can be used in 
this step. We note that the warm-start approach [31] is quite helpful in the D-step because the difference 
between two conditionally optimal solutions in adjacent two convex polytopes is typically very small. In 
§ 4.4, we describe the details of our implementation of the D-step. Figure 5 illustrates an example of the 
local solution path obtained by OP-0. 
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Figure 5: An example of the local solution path by OP-0 on a simple toy data set (with C = 200). The 
paths of five Lagrange multipliers a*[, ■ ■ ■ , a\ are plotted in the range of 0 € [0,1]. Open circles represent 
the discontinuous points in the path. In this simple example, we had experienced three discontinuous points 
at 9 = 0.37,0.67 and 0.77. 

4.2 Continuous-Step for OP-0 

In the C-step, the partition V := {I, O} is fixed, and our task is to solve a sequence of convex quadratic 
problems (QPs) parameterized by 0 within the convex polytope po\(V', s). It has been known in optimization 
literature that a certain class of parametric convex QP can be exactly solved by exploiting the piecewise 
linearity of the solution path [23]. We can easily show that the local solution path of OP-0 within a convex 
polytope is also represented as a piecewise-linear function of 0. The algorithm presented here is similar to 
the regularization path algorithm for SVM given in [32]. 

Let us consider a partition of the inliers in 1 into the following three disjoint sets: 

n := {i\l <yif{xi)}, 

£ ■= {*1 Vif(xi) = i}, 

C := {*|s < Vif(xi ) < 1}. 

For a given fixed partition {1Z, £, C , 0}, the KKT conditions of the convex problem (8) indicate that 
aj = 0 V i G 1Z, an = C V i e £, on = CO V i G O. 

The KKT conditions also imply that the remaining Lagrange multipliers {a^}igg must satisfy the following 
linear system of equations: 

Vif{xi) = ^ OL j y i y j K{x i ,x j ) = 1 V i € £ 

je N„ 

<=> Qse a e = 1 — Qsc^C — QsolCO, (19) 
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where Q £ R nx ™ is an n x n matrix whose entry is defined as Qij := yiyjK{xi, Xj). Here, a notation 

such as Qsc represents a submatrix of Q having only the rows in the index set £ and the columns in the 
index set £. By solving the linear system of equations (19), the Lagrange multipliers aii,i £ N„, can be 
written as an affine function of 9. 

Noting that yif(xi) = a jyiUj^( x ii x j) is also represented as an affine function of 9, any changes 

of the partition {1Z, £. £} can be exactly identified when the homotopy parameter 9 is continuously decreased. 
Since the solution path linearly changes for each partition of {7 Z,£,£}, the entire path is represented as a 
continuous piecewise-linear function of the homotopy parameter 9. We denote the points in 9 £ [0,1) at 
which members of the sets {1Z, £,£} change as break-points 9bp- 

Using the piecewise-linearity of yif(xi), we can also identify when we should switch to the D-step. Once 
we detect an instance satisfying y%f(xi) = s, we exit the C-step and enter the D-step. 

4.3 Continuous-Step for OP-s 

Since 9 is fixed to 0 in OP-s, the KKT conditions (10) yields 


a* = 0 V i £ O. 

This means that outliers have no influence on the solution and thus the conditionally optimal solution 
does not change with s as long as the partition V is unchanged. The only task in the C-step for OP-s is 
therefore to find the next s that changes the partition V. Such s can be simply found as 

s «- min yifixi). 

4.4 Discontinuous-Step (for both OP-6* and OP-s) 

As mentioned before, any convex QP solver can be used for the D-step. When the algorithm enters the 
D-step, we have the conditionally optimal solution /£ for the partition V := {I, O}. Our task here is to find 
another conditionally optimal solution /t for V := {1,0} given by (11). 

Given that the difference between the two solutions and /-, is typically small, the D-step can be 
efficiently implemented by a technique used in the context of incremental learning [33]. 

Let us define 


Ax:= {* € 1 | yif v {xi) = s}, 

A 0-j.z '■= {i £ O | yifv{xi) = s}. 

Then, we consider the following parameterized problem with parameter p £ [0,1]: 

f'pi.Xi,p) .— fi mA fi V i £ 
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where 


A/* := yi 


K i. 


K 


i,Ao- 


Oi 


OL 


(bef) 


(bef) 


- ice 

- 1C 


and ai^ bef ) be the corresponding a at the beginning of the D-Step. We can show that fp{xi\ /i) is reduced to 
f-p{xi) when n = 1, while it is reduced to f-p(xi) when n = 0 for all i € N ra . By using a similar technique to 
incremental learning [33], we can efficiently compute the path of solutions when /i is continuously changed 
from 1 to 0. This algorithm behaves similarly to the C-step in OP-0. The implementation detail of the 
D-step is described in Appendix C. 


5 Numerical Experiments 

In this section, we compare the proposed outlier-path (OP) algorithm with conventional concave-convex 
procedure (CCCP) [24] because, in most of the existing robust SVM studies, non-convex optimization for 
robust SVM training are solved by CCCP or a variant called difference of convex (DC) programming [9, 10, 
11, 12, 14, 15]. 

5.1 Setup 

We used several benchmark data sets listed in Tables 1 and 2. We randomly divided data set into training 
(40%), validation (30%), and test (30%) sets for the purposes of optimization, model selection (including the 
selection of 8 or s), and performance evaluation, respectively. For robust SVC, we randomly flipped 15% of 
the labels in the training and the validation data sets. For robust SVR, we first preprocess the input and 
output variables; each input variable was normalized so that the minimum and the maximum values are 
— 1 and + 1 , respectively, while the output variable was standardized to have mean zero and variance one. 
Then, for the 5% of the training and the validation instances, we added an uniform noise U(—2, 2) to input 
variable, and a Gaussian noise iV(0,10 2 ) to output variable, where U(a,b) denotes the uniform distribution 
between a and b and iV(/x, a 2 ) denotes the normal distribution with mean fi and variance a 2 . 

5.2 Generalization Performance 

First, we compared the generalization performance. We used the linear kernel and the radial basis function 
(RBF) kernel defined as K(xi,Xj) = exp (—■ 7 ||£Cj — ®j|| 2 ), where 7 is a kernel parameter fixed to 7 = 1/d 
with d being the input dimensionality. Model selection was carried out by finding the best hyper-parameter 
combination that minimizes the validation error. We have a pair of hyper-parameters in each setup. In all 
the setups, the regularization parameter C was chosen from { 0 . 01 , 0 . 1 , 1 , 10 , 100 }, while the candidates of 
the homotopy parameters were chosen as follows: 
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Table 1: Benchmark data sets for robust SVC experiments 


Data 

n 

d 

D1 

BreastCancerDiagnostic 

569 

30 

D2 

AustralianCredit 

690 

14 

D3 

GermanN umer 

1000 

24 

D4 

SVMGuidel 

3089 

4 

D5 

spambase 

4601 

57 

D6 

musk 

6598 

166 

D7 

gisette 

6000 

5000 

D8 

w5a 

9888 

300 

D9 

a6a 

11220 

122 

DIO 

a7a 

16100 

122 


n = # of instances, d = input dimension 


Table 2: Benchmark data sets for robust SVR experiments 


Data 

n 

d 

D1 

bodyfat 

252 

14 

D2 

yacht_hydrodynamics 

308 

6 

D3 

mpg 

392 

7 

D4 

housing 

506 

13 

D5 

mg 

1385 

6 

D6 

winequality-red 

1599 

11 

D7 

winequality-white 

4898 

11 

D8 

space_ga 

3107 

6 

D9 

abalone 

4177 

8 

D10 

cpusmall 

8192 

12 

Dll 

cadata 

20640 

8 


n = # of instances, d = input dimension 


• In OP-0, the set of break-points Obp € [0,1] was considered as the candidates (note that the local 
solutions at each break-point have been already computed in the homotopy computation). 

• In OP-s, the set of break-points in [sc, 0] was used as the candidates for robust SVC, where 

s c ■= min yifsvcfai) 

ieN n 

with fsvc being the ordinary non-robust SVC. For robust SVR, the set of break-points in [s^, 0.2 S[r] 
was used as the candidates, where 


s R := max | y t - /svr(*»)| 

ieN„ 


with /svc being the ordinary non-robust SVR. 


• In CCCP-0, the homotopy parameter 6 was selected from 


9 G {1,0.75,0.5,0.25,0}. 
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• In CCCP-s, the homotopy parameter s was selected from 


s £ {sc,0.75sc,0.5sc,0.25sc,0} 


for robust SVC, while it was selected from 


s £ {si?, 0.8s_r, 0.6s_r, 0.4sfl, 0.2sa} 


for robust SVR. 

Note that both OP and CCCP were initialized by using the solution of standard SVM. 

Tables 3-6 represent the average and the standard deviation of the test errors on 10 different random 
data splits. These results indicate that our proposed OP algorithm tends to find better local solutions and 
the degree of robustness was appropriately controlled. 


Table 3: The mean of test error by 0-1 loss and standard deviation (linear, robust SVC). Smaller test error 
is better. The numbers in bold face indicate the better method in terms of the test error. 


Data 

C-SVC 

CCCP-0 

OP-0 

CCCP-s 

OP-s 

D1 

■ 056(.016) 

•050(.014) 

• 049 (. 016 ) 

.055(.018) 

• 050 (. 016 ) 

D2 

• 151(.018) 

, 145 (. 007 ) 

.151(.018) 

• 145 (. 007 ) 

.152(.010) 

D3 

• 281(.028) 

■ 270(.033) 

■ 270(.023) 

• 262 (. 013 ) 

.266(.013) 

D4 

.066(.007) 

.047(.007) 

.047(.005) 

.053(.010) 

• 042 (. 006 ) 

D5 

.108(.010) 

.088(.009) 

.088(.009) 

.088(.010) 

• 084 (. 007 ) 

D6 

.072(.005) 

• 058 (. 006 ) 

.064(.003) 

.061(.007) 

• 060 (. 003 ) 

D7 

• 185(.013) 

.184(.010) 

.184(.010) 

.184(.010) 

.184(.010) 

D8 

.020(.002) 

.020(.003) 

.020(.002) 

.021(.003) 

• 020 (. 003 ) 

D9 

• 173(.004) 

.181(.009) 

, 173 (. 005 ) 

.165(.004) 

• 164 (. 004 ) 

DIO 

• 173(.008) 

.176(.006) 

• 173 (. 007 ) 

• 160 (. 004 ) 

• 161(.005) 


Table 4: The mean of test error by 0-1 loss and standard deviation (RBF, robust SVC). 


Data 

C-SVC 

CCCP-0 

OP-0 

CCCP-s 

OP-s 

D1 

.055(.017) 

.043(.022) 

• 042 (. 017 ) 

• 037 (. 016 ) 

.038(.013) 

D2 

.149(.010) 

.148(.010) 

• 147 (. 010 ) 

.146(.013) 

• 142 (. 013 ) 

D3 

.276(.024) 

■ 267(.026) 

• 266 (. 024 ) 

.271(.015) 

.261 (. 020 ) 

D4 

.052(.009) 

.048(.009) 

• 044 (. 006 ) 

.047(.008) 

, 040 (. 005 ) 

D5 

.117(.012) 

.109(.013) 

• 107 (. 012 ) 

.107(.011) 

• 094 (. 008 ) 

D6 

.046(.007) 

.045(.007) 

.045(.007) 

.045(.007) 

• 043 (. 006 ) 

D7 

■044(.003) 

■ 044(.003) 

.044(.003) 

.044(.003) 

.044(.003) 

D8 

.022(.003) 

.022(.003) 

.022(.003) 

.022(.003) 

.021 (. 002 ) 

D9 

.169(.003) 

.170(.005) 

• 169 (. 004 ) 

.168(.005) 

• 162 (. 003 ) 

DIO 

.163(.003) 

.163(.003) 

.163(.003) 

.162(.002) 

• 160 (. 004 ) 
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Table 5: The mean of L\ test error and standard deviation (linear, robust SVR). 


Data 

C-SVR 

CCCP-0 

op-0 

CCCP-s 

OP-s 

D1 

.442(.324) 

■ 337(.347) 

, 319 (. 353 ) 

■ 414(.341) 

• 276 (. 321 ) 

D2 

■470(.053) 

■487(.086) 

• 474 (. 087 ) 

.490(.108) 

• 484 (. 104 ) 

D3 

.414(.038) 

• 351(.025) 

, 350 (. 036 ) 

.414(.105) 

• 372 (. 043 ) 

D4 

•548(.180) 

• 520(.193) 

• 510 (. 146 ) 

• 562 (. 210 ) 

.596(.297) 

D5 

•539(.019) 

• 531(.019) 

• 530 (. 017 ) 

.539(.024) 

• 529 (. 018 ) 

D6 

■ 685(.028) 

■ 664(.026) 

• 655 (. 027 ) 

• 685 (. 044 ) 

.686(.040) 

D7 

.700(.016) 

.691(.017) 

• 685 (. 017 ) 

.698(.022) 

, 692 (. 014 ) 

D8 

■ 582(.027) 

■ 583(.042) 

• 570 (. 031 ) 

.589(.035) 

■ 569 (. 028 ) 

D9 

.518(.015) 

• 510(.019) 

• 501 (. 021 ) 

.522(.026) 

■ 516 (. 019 ) 

DIO 

• 281(.021) 

, 278 (. 016 ) 

■ 279(.016) 

.269(.018) 

.269(.021) 

Dll 

.494(.010) 

•488(.011) 

, 487 (. 012 ) 

.492(.009) 

.492(.008) 


Table 6: The mean of L\ test error and standard deviation (RBF, robust SVR). 


Data 

C-SVR 

CCCP-0 

OP-0 

CCCP-s 

OP-s 

D1 

.077(.049) 

.069(.054) 

• 065 (. 056 ) 

.070(.053) 

.051 (. 040 ) 

D2 

■ 357(.059) 

■ 346(.045) 

• 339 (. 045 ) 

.332(.038) 

• 327 (. 040 ) 

D3 

■ 337(.052) 

• 299 (. 021 ) 

.302(.019) 

.296(.022) 

• 295 (. 022 ) 

D4 

.390(.046) 

■ 350(.025) 

• 349 (. 023 ) 

.357(.022) 

• 343 (. 024 ) 

D5 

.513(.024) 

• 519(.028) 

• 504 (. 018 ) 

.515(.024) 

• 503 (. 019 ) 

D6 

• 641(.028) 

• 640(.015) 

• 635 (. 017 ) 

.634(.022) 

.631 (. 017 ) 

D7 

.671(.011) 

■ 669(.009) 

.669(.007) 

.674(.011) 

.671 (. 009 ) 

D8 

■ 528(.027) 

■ 504(.027) 

• 496 (. 024 ) 

.511(.018) 

• 510 (. 020 ) 

D9 

■488(.012) 

.490(.016) 

, 486 (. 012 ) 

.484(.013) 

• 482 (. 014 ) 

DIO 

■ 198(.015) 

.198(.027) 

• 196 (. 025 ) 

.194(.015) 

• 189 (. 017 ) 

Dll 

•456(.016) 

•441(.005) 

•441(.006) 

• 444 (. 015 ) 

.446(.015) 


5.3 Computation Time 

Finally, we compared the computational costs of the entire model-building process of each method. The 
results are shown in Figure 6. Note that the computational cost of the OP algorithm does not depend on 
the number of hyper-parameter candidates of 6 or s, because the entire path of local solutions has already 
been computed with the infinitesimal resolution in the homotopy computation. On the other hand, the 
computational cost of CCCP depends on the number of hyper-parameter candidates. In our implementation 
of CCCP, we used the warm-start approach, i.e., we initialized CCCP with the previous solution for efficiently 
computing a sequence of solutions. The results indicate that the proposed OP algorithm enables stable and 
efficient control of robustness, while CCCP suffers a trade-off between model selection performance and 
computational costs. 
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(a) Elapsed time for CCCP and proposed OP (linear, robust SVC) 




(b) Elapsed time for CCCP and proposed OP (RBF, robust SVC) 




(c) Elapsed time for CCCP and proposed OP (linear, robust SVR) 




(d) Elapsed time for CCCP and proposed OP (RBF, robust SVR) 


Figure 6: Elapsed time when the number of ( 0 , s)-candidates is increased. Changing the number of hyper¬ 
parameter candidates affects the computation time of CCCP, but not OP because the entire path of solutions 
is computed with the infinitesimal resolution. 


6 Conclusions 


In this paper, we proposed a novel robust SVM learning algorithm based on the homotopy approach that 
allows efficient computation of the sequence of local optimal solutions when the influence of outliers is 
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gradually emphasized. The algorithm is built on our theoretical findings about the geometric property and 
the optimality conditions of local solutions of robust SVM. Experimental results indicate that our algorithm 
tends to find better local solutions possibly due to the simulated annealing-like effect and the stable control of 
robustness. One of the important future works is to adopt scalable homotopy algorithms [28] or approximate 
parametric programming algorithms [34] for further improving the computational efficiency. 
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A Proof of Theorem 3 

Although /£ is a feasible solution, it is not a local optimum for 6 G [0,1) and s < 0 because 


a t < CO for ielnO, (20a) 

o.i>C for ieOHl, (20b) 

violate the KKT conditions (10) for V . These feasibility and sw6-optimality indicates that 

Jp(f^0)< Mfv-,0), (21) 

we arrive at (12). Q.E.D. 


B Proof of Theorem 4 

Sufficiency: If (13e) is true, i.e., if there are NO instances with yifZ>(xi) = s, then any convex problems 
defined by different partitions V ^ V do not have feasible solutions in the neighborhood of This means 
that if /p is a conditionally optimal solution, then it is locally optimal. (13a)-(13d) are sufficient for to 
be conditionally optimal for the given partition V. Thus, (13) is sufficient for /J, to be locally optimal. 

Necessity: From Theorem 3, if there exists an instance such that yif£>(xi) = s, then fX is a feasible 
but not locally optimal. Then (13e) is necessary for /£ to be locally optimal. In addition, (13a)-(13d) are 
also necessary for local optimality, because of every local optimal solutions are conditionally optimal for the 
given partition V. Thus, (13) is necessary for /£ to be locally optimal. 

Q.E.D. 

C Implementation of D-step 

In D-step, we work with the following convex problem 

fp ■= argmin Jp(f;9). (22) 

/epol("P;s) 

where, V is updated from V as (11). 

Let us define a partition II := {1Z, £, C,X', O ', O''} of N„ such that 


i G TZ 

=> 

Vif{xi) > 1, 

(23a) 

i G £ 


Vif(xi ) = 1, 

(23b) 

i G C 


s < yif{xi) < 1, 

(23c) 

i G 1' 


2 lif(xi) = s and i G X, 

(23d) 

i G O' 


yif{xi) = s and i G O , 

(23e) 

i G O" 


yif{xi) < s. 

(23f) 


24 



If we write the conditionally optimal solution as 


fp{x) : = a*yjK(x,Xj), (24) 

jeNn 

{a*}j £ N n must satisfy the following KKT conditions 

Vifpixi) >1 =► a* = 0 (25a) 

y i f*p{x i ) = l =► a* e [0,(7], (25b) 

s < Vifpixi) <1 => a* = C (25c) 

Vifpixi) = s,i€ H =► a* > C, (25d) 

yifp(xi) = s,i€& =► a* < CO , (25e) 

|/i/£(*<) =► a? = C0. (25f) 


At the beginning of the D-step, fp(xi) violates the KKT conditions by 
A/i := Vi -fQ.Ai 

where a{ bef ) is the corresponding a at the beginning of the D-step, while Az->e> and denote the 

difference in T 7 and T 7 defined as 

Ai-s-o := {* € X | yifr{xi) = s}, 

A 0 -j.z := {* € 0 | Uifv(xi) = s}. 

Then, we consider the following another parametrized problem with a parameter p £ [0,1]: 



fp{xi] f-i) . fp{xi) T pA fi V i G M, 


In order to always satisfy the KKT conditions for fp(xp p), we solve the following linear system 




Q A,Ao-ji 


Qa,cXC -Q a & ,\C0 


a? ef) -ICO 
cA bef) 1C 


Ah 


where A := {£,27, O'}. This linear system can also be solved by using the piecewise-linear parametric 
programming while the scalar parameter p is continuously moved from 1 to 0. 

In this parametric problem, we can show that fp(xpfi) = f^>{x.i) if p = 1 and /t(xj;/x) = fp(x j) if 
H = 0 for all i € N„. 

Since the number of elements in Ax^o and Ao_,i are typically small, the D-step can be efficiently 
implemented by a technique used in the context of incremental learning [33]. 
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